Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine publish notifications #271

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

qiluo-msft
Copy link
Contributor

This is a competing PR to #258

I reverted #214 since it exposed unnecessary implementation details.

Problem statement is same as #258, so I shamelessly copied it here.

The typical use case that this PR is trying to deal with is related to route flapping.

Assuming such a scenario:
<1> 4K routes learned on an interface.

<2> interface down, 4k routes withdraw.
<3> fpmsyncd inserts 4k route keys into producerStateTabk keyset and generates 4k redis publish notifications

<4> Interface goes up

<4.A>
Interface may go up and learn the 4k routes back before orchagent has chance to pick up any notification and process the keyset. That is fine, change in #257 handled this case, no new redis publish notification will be created since the route keys already exist in producerStateTabk keyset and are to be processed.

<4.B>
Interface may go up and learn the 4k routes back after orchagent has processed a few notifications but not all the redis notifications (say, 1k notification processed), with the first notification, consumer in orchagent batch popped all the 4k route keys from producerStateTabk keyset, (orchagent has set popBatchSize as 8192), for the other 999 notifications processed, orchagent just call consumer_state_table_pops.lua 999 times without seeing any real data.
4k - 1k = 3k more redis notifications pending in redis queue now.

<5> Interface goes down
<5.A> If <4.A> was hit during previous interface up, good, no new redis publish notification will be generated

<5.B> If <4.B> was hit during previous interface up. 4K more redis notifications put into the redis queue.
<6> Interface goes up, loop back to <4>

The exact scenarios vary, but from <4> to <6>, number of pending notifications keeps increasing.

The issue is that orchagent may be abled to handle all the route changes with the batch processing (popBatchSize = 8192), redis publish notification was generated for each key and orchagent has to respond to each of the notifications with current select framework though most of them bring no value.

The implementation proposed here follows some design principles.

  1. Generate one and only one Redis publish notification (channel message) when there is one data in the key set. So the number of messages and keys are the same, and we can ensure to decrease the same amount during pop/pops()
  2. The only exception to above is when producer calls clear(). The channel messages could not be recalled.
  3. Some users will prefer pop(), and others may prefer pops() to pop a batch. But no one will use both for the same consumer table
  4. Consumer will never publish to channel, ie. create channel message

@qiluo-msft
Copy link
Contributor Author

qiluo-msft commented Mar 10, 2019

@jipanyang @zhenggen-xu As we offline discussed I submit this PR. Please help review. Thanks for your solution and inspiration. #Closed

@qiluo-msft qiluo-msft changed the title Qiluo/producer state publish2 Refine publish notifications Mar 10, 2019
@qiluo-msft qiluo-msft requested review from pavel-shirshov, lguohan and kcudnik and removed request for pavel-shirshov and lguohan March 10, 2019 08:03
kcudnik
kcudnik previously approved these changes Mar 10, 2019
Copy link
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with the existing implementation, changes in this PR optimized the processing of notification at Select() of consumerStateTable side.

We need to confirm if it is good enough to cope with the route flapping case seen previously. @zhenggen-xu

* This is no big harm since all the late messages will be selected and nothing popped
* when the channel is not completely busy.
*/
n -= m_queueLength;
Copy link
Contributor

@jipanyang jipanyang Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any usage of "n" after this operation? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It will be optimized out by compiler. It may help readability a little bit.


In reply to: 276422184 [](ancestors = 276422184)

Copy link
Contributor

@jipanyang jipanyang Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It will be optimized out by compiler. It may help readability a little bit.

In reply to: 276422184 [](ancestors = 276422184)

I think it causes more confusion than helping readability. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


In reply to: 276515514 [](ancestors = 276515514)

* in the select-pop(s) pattern
*/
if (n <= 0)
n = 1;
Copy link
Contributor

@jipanyang jipanyang Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario n <= 0?

#Resolved

Copy link
Contributor Author

@qiluo-msft qiluo-msft Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to prevent any logic error outside the scope of RedisSelect. Considering the case if there are redundant notifications but there is no data popped, caller may call discard(0), and here the function consumes one message in the queue if there is any. discard(negative) is not for protection purpose and should be not in a true case.


In reply to: 276515380 [](ancestors = 276515380)

* This is no big harm since all the late messages will be selected and nothing popped
* when the channel is not completely busy.
*/
n -= m_queueLength;
Copy link
Contributor

@jipanyang jipanyang Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It will be optimized out by compiler. It may help readability a little bit.

In reply to: 276422184 [](ancestors = 276422184)

I think it causes more confusion than helping readability. #Resolved

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
@qiluo-msft qiluo-msft force-pushed the qiluo/producerState_publish2 branch from 40e9e79 to 74778c0 Compare April 25, 2019 00:57
Copy link
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@lguohan
Copy link
Contributor

lguohan commented May 24, 2019

retest this please

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we made any tests checking the new implementation is faster than previous?
Other as comments

// Override with an empty body, and subclasses will explicitly discard messages in the channel
void ConsumerTableBase::updateAfterRead()
{
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why do we need this?
The base class already has an empty body

r = cs.isQueueEmpty();
EXPECT_TRUE(r);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test was removed?

bool ConsumerTableBase::hasCachedData()
{
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class has the same implementation. It doesn't need to be implemented here.

@@ -88,7 +88,7 @@ bool SubscriberStateTable::hasData()

bool SubscriberStateTable::hasCachedData()
{
return m_buffer.size() > 1 || m_keyspace_event_buffer.size() > 1;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's default implementation in the base class. We could remove whole method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants